POLI 572B

Michael Weaver

February 3, 2020

Least Squares

Objectives

Mechanics of Bivariate Regression

  • Relationship between variables
    • Covariance; Correlation
  • the mean (again)
  • conditional expectation function
  • “Ordinary Least Squares”
    • algorithm
    • mathematical properties of algorithm
    • no assumptions

Introduction:

Why least squares?

  • A “simple”, flexible tool; many applications
  • Ubiquitous, “good enough”

Application to causality

  • mean of \(Y\) across different values of \(Z\).
    • with assumptions, can give a causal interpretation.
  • mean of \(Y\) across values of \(D\), conditioning on \(X\)
    • with more assumptions; can obtain “ignorability”

Today

No causality

  • no effort to prove causality; no assumptions about causal model.

No parameters

  • no statistical model; no random variables; no parameters to estimate

Regression/Least Squares as Algorithm

  • algorithm with mathematical properties
  • without further assumption, plug in data, generate a line
  • limited interpretation

Today

  • Association between two variables

  • The Mean

  • Regression/Least Squares extends the mean

Basic Concepts

Covariance and Correlation

How are these variables associated?

Covariance and Correlation

How are these variables associated?

Covariance and Correlation

Covariance

\[Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\]

\[Cov(X,Y) = \overline{xy} - \bar{x}\bar{y}\]

  • Divide by \(n-1\) for sample covariance.

Covariance and Correlation

Variance

Variance is also the covariance of a variable with itself:

\[Var(X) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})^2\]

\[Var(X) = \overline{x^2} - \bar{x}^2\]

Covariance of tree width and tree height:

x1 = trees$Girth
y1 = trees$Height
mean(x1*y1) - (mean(x1)*mean(y1))
## [1] 10.04839

Covariance of tree width and timber volume:

x2 = trees$Girth
y2 = trees$Volume
mean(x2*y2) - (mean(x2)*mean(y2))
## [1] 48.27882

Why is the \(Cov(Girth,Volume)\) larger?

Why is the \(Cov(Girth,Volume)\) larger?

  • Scale of covariance reflects scale of the variables.

  • Can’t directly compare the two covariances

Covariance: Intuition

Correlation

Correlation puts covariance on a standard scale

Covariance

\(Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)

Pearson Correlation

\(r(X,Y) = \frac{Cov(X,Y)}{SD(X)SD(Y)}\)

  • Dividing by product of standard deviations scales the covariance

  • \(|Cov(X,Y)| <= \sqrt{Var(X)*Var(Y))}\)

Correlation

  • Correlation coefficient must be between \(-1, 1\)

  • At \(-1\) or \(1\), all points are on a straight line

  • Negative value implies increase in \(X\) associated with decrease \(Y\).

  • If correlation is \(0\), the covariance must be?

  • If \(Var(X)=0\), then \(Cor(X,Y) = ?\)

Correlation: Interpretation

  • Correlation of \((x,y)\) is same as correlation of \((y,x)\)

  • Values closer to -1 or 1 imply “stronger” association

  • Correlations cannot be understood using ratios.
    • Correlation of \(0.8\) is not “twice” as correlated as \(0.4\).
  • Pearson correlation agnostic about outliers/nonlinearities
  • Correlation is not causation

Practice:

In groups of 2-3:

Without using cor() or cov() or var() functions in R

  1. Calculate mean of \(X\); mean of \(Y\)
  2. Calculate \(Var(X)\) and \(Var(Y)\)
  3. Calculate correlation \((X,Y)\)

Practice:

Covariance and Correlation

  • Both measure linear association of two variables
  • Scale is either standardized (correlation) or in terms of products (covariance)
  • May be inappropriate in presence of non-linearity; outliers

The Mean: Revisited

Squared Deviations

Why are we always squaring differences?

Variance

\(\frac{1}{n}\sum\limits_{i = 1}^{n} (x_i - \bar{x})^2\)

Covariance

\(\frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)

Mean Squared Error

\(\frac{1}{n}\sum\limits_{i=1}^n(\hat{y_i} - y_i)^2\)

Squared Deviations

It is about distance

What is the distance between two points?

In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]


In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)

\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)

What is the distance between two points?

What is the distance between two points?

\(p = (3,0); q = (0,4)\)

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\] \[d(p,q) = \sqrt{(3 - 0)^2 + (0 - 4)^2}\] \[d(p,q) = \sqrt{3^2 + (-4)^2} = \ ?\]

What is the distance between two points?

Remember Pythagoras?

What is the distance between two points?

In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]


In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)

\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)

The Mean

The mean minimizes the variance.

  • we saw this before (why we divide by \(n-1\) when estimating the population variance)
  • this is the same as minimizing the distance, because variance is mathematically linked to the distance calculation

The Mean

The mean minimizes the variance.

  • we saw this before (why we divide by \(n-1\) when estimating the population variance)

If we observe values of \(Y\), \(y_i\); we choose a single number, \(\hat{y}\), to be our estimate for each value of \(y\):


the mean is the estimate that minimizes the distance between \(\hat{y}\) and each of the \(y_i\)s.

Deriving the mean:

Imagine we have a variable \(Y\) that we observe as a sample of size \(n\). We can represent this variable as a vector in \(n\) dimensional space.

\[y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\]

What is a vector?

  • a vector is a one-dimensional array of numbers of length \(n\).

  • can be portrayed graphically as an arrow from the origin to a point in \(n\) dimensional space

  • can be multiplied by a number to extend/shorten their length

Deriving the mean:

Deriving the mean:

We want to pick one number (a scalar) \(\hat{y}\) to predict all of the values in our vector \(y\).

This is equivalent to doing this:

\[y = \begin{pmatrix}3 \\ 5 \end{pmatrix} \approx \begin{pmatrix}\hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix}1 \\ 1 \end{pmatrix}\]

Multiplying a (1,1) vector by a constant.

Choose \(\hat{y}\) on the blue line at point that minimizes the distance to \(y\).

Deriving the mean:

\(y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\)

can be decomposed into two separate vectors: a vector containing our prediction (\(\hat{y}\)):

\(\begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\)

and another vector \(\mathbf{e}\), which is difference between the prediction vector and the vector of observations:

\(\mathbf{e} = \begin{pmatrix}3 \\ 5 \end{pmatrix} - \begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix}\)

Deriving the mean:

This means our goal is to minimize \(\mathbf{e}\).

How do we find the closest distance? The length of \(\mathbf{e}\) is calculated by taking:

\[len(\mathbf{e})= \sqrt{(3-\hat{y})^2 + (5 - \hat{y})^2}\]

When is the length of \(\mathbf{e}\) minimized?

  • when angle between \(\hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\) and \(\mathbf{e}\) is \(90^{\circ}\).

Deriving the mean:

Deriving the mean:

Value of sample of size \(n\) are represented by a vector in \(n\) dimensional space

  • We choose a value \(\hat{y}\) in one dimensional sub-space (typically on a line through: \(\begin{pmatrix} 1_1 \ 1_2 \ldots 1_n \end{pmatrix}\))

  • Such that \(\hat{y}\) is smallest distance to \(y\).

Conditional Expectation Function

Generalizing the Mean:

The mean is useful…

… but often we want to know if the mean of something \(Y\) is different across different values of something else \(X\).

To put it another way: the mean of \(Y\) is the \(E(Y)\) (if we are talking about random variables). Sometimes we want to know \(E(Y | X)\)

  • simplest version could be difference in means (experiments)

Generalizing the Mean:

We are interested in finding some conditional expectation function (Angrist and Pischke)

expectation: because it is about the mean - \(E(Y)\)

conditional: because it is conditional on values of \(X\)\(E[Y |X]\)

function: because \(E(Y) = f(X)\), there is some relationship we can look at between values of \(X\) and \(E(Y)\).

\[E[Y | X = x]\]

Generalizing the Mean:

There are many ways to get the conditional expectation function

  • one easy way is to assume that the CEF is linear.
  • That is to say \(E(Y)\) is linear in \(X\).
  • The function takes the form of an equation of a line.

Equation of a line

Equation of a line

Equation of a line

\(slope = \frac{rise}{run} = \frac{y2-y1}{x2-x1}\)
  • Change in \(y\) with a 1 unit change in \(x\).

Equation of a line

Equation of a line

\(intercept = (y | x=0)\)
  • Value of \(y\) when \(x = 0\). Where the line crosses the \(y\)-axis.

Equation of a line

\(y = intercept + slope*x\)

or, by convention:

\(y = a + bx\)

Generalizing the Mean:

How do we choose the line that best captures:

\[E(Y) = a + b\cdot X\]

What line fits this?

Which line?

Which line?

Which line?

Graph of Averages

Which line?

The red line above is the regression line or the fit using least squares.

It closely approximates the conditional mean of son’s height (\(Y\)) across values of father’s height (\(X\)).

How do we obtain this line mathematically?

We can do it the same way we obtained the mean!

Minimizing the Distance

We are going to choose an intercept \(a\) and slope \(b\) such that:

\(\hat{y}_i = a + b \cdot x_i\)

and that minimizes the distance between the fitted (\(\hat{y}_i\)) and true (\(y_i\)) values:

\(\sqrt{\sum\limits_i^n (y_i - \hat{y_i})^2}\)

Minimizing the Distance

Another way of thinking of this is in terms of residuals, or the difference between true and fitted values using the equation of the line.

\(e_i = y_i - \hat{y_i}\)

Minimizing the distance also means minimizing the sum of squared residuals \(e_i\)

  • this is exactly what we did above with the mean

Minimizing the Distance

We need to solve this equation:

\[\min_{a,b} \sum\limits_i^n (y_i - a - b x_i)^2\] Choose \(a\) and \(b\) to minimize this value, given \(x_i\) and \(y_i\)

We can do this with calculus: solve for when first derivative is \(0\)

Minimizing the Distance

First, we take derivative with respect to \(a\): yields:

\(-2 \left[ \sum\limits_i^n (y_i - a - b x_i) \right] = 0\)

\(\sum\limits_i^n y_i - \sum\limits_i^n a - \sum\limits_i^n b x_i = 0\)

\(-\sum\limits_i^n a = -\sum\limits_i^n y_i + \sum\limits_i^n b x_i\)

\(\sum\limits_i^n a = \sum\limits_i^n y_i - b x_i\)

Minimizing the Distance

\(\sum\limits_i^n a = \sum\limits_i^n y_i - b x_i\)

Dividing both sides by \(n\), we get:

\(a = \bar{y} - b\bar{x}\)

Where \(\bar{y}\) is mean of \(y\) and \(\bar{x}\) is mean of \(x\).

Implication: regression line goes through the point of averages \(\bar{y} = a + b \bar{x}\)

Minimizing the Distance

Next, we take derivative with respect to \(b\):

\(-2 \left[ \sum\limits_i^n (y_i - a - b x_i) x_i\right] = 0\)

\(\sum\limits_i^n (y_i - (\bar{y} - b\bar{x}) - b x_i) x_i = 0\)

\(\sum\limits_i^n (y_i - (\bar{y} - b\bar{x}) - b x_i) x_i = 0\)

\(\sum\limits_i^n y_ix_i - \bar{y}x_i + b\bar{x}x_i - b x_ix_i= 0\)

Minimizing the Distance

\(\sum\limits_i^n (y_i - \bar{y})x_i = b\sum\limits_i^n (x_i - \bar{x})x_i\)

Dividing both sides by \(n\) gives us:

\(\frac{1}{n}\sum\limits_i^n y_ix_i - \bar{y}x_i = b\frac{1}{n}\sum\limits_i^n x_i^2 - \bar{x}x_i\)

\(\overline{yx} - \bar{y}\bar{x} = b \overline{xx} - \bar{x}\bar{x}\)

\(Cov(y,x) = b \cdot Var(x)\)

\(\frac{Cov(y,x)}{Var(x)} = b\)

Deriving Least Squares

The slope:

\[b = \frac{Cov(x,y)}{Var(x)} = r \frac{SD_y}{SD_x}\]

  • Expresses how much mean of \(Y\) changes for a 1-unit change in \(X\)
  • When expressed as function of correlation coefficient \(r\), we see this rise (\(SD_y\)) over the run (\(SD_x\))

The Intercept:

\[a = \overline{y} - \overline{x}\cdot b\]

Shows us that at \(\bar{x}\), the line goes through \(\bar{y}\). The regression line (of predicted values) goes through the point \((\bar{x}, \bar{y})\) or the point of averages.

Deriving Least Squares

There are other ways to derive least squares.

  • Because we eventually want multiple variables, need to build from an intuition rooted in matrices and their relationship to distance.
  • Adding more variables puts into minimizing distance in a \(n > 3\) dimensional space, so it gets weird.
  • Nevertheless, this exercise is helpful

Key facts about regression:

The math of regression ensures that:

\(1\). the mean of the residuals is always zero. Because we included an intercept (\(a\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.

\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares. We will see why this is next week. But we can also prove it in class.

  • Also means: Correlation of \(X\) and residuals \(e\) is exactly \(0\). Why is this?

\(3\). \(Var(X) > 0\) in order to compute \(a\) and \(b\). Why is this?

In small groups:

Using the trees data in R:

Take \(Y = Volume\) and \(X = Girth\)

  1. Calculate \(a\) and \(b\) using the equations give above
  2. Interpret \(a\) in terms of CEF
  3. What is \(\hat{y}\) when Girth is 10? What does this mean in terms of the CEF?
  4. Calculate \(e\)
  5. Calculate the mean of \(e\)
  6. Calculate the correlation of \(X\) and \(e\)

Summary

Key ideas

These facts are mathematical truths about least squares. Unrelated to assumptios needed for statistical/causal inference

  • Can fit least squares to any scatterplot (regardless of how sensical it is), if \(x\) has positive variance.

  • Least squares line minimizes the sum of squared residuals (minimizes the distance between predicted values and actual values of \(y\)).

  • Least Squares line always goes through point of averages; can be computed exactly from “graph of averages”

  • Residuals \(e\) are always uncorrelated with \(x\) if there is an intercept because they are orthogonal \(x\) with mean of \(0\).

Key ideas

  • We have no addressed in any way how this relates to a statistical model or a causal model.
  • linear conditional expectation function should be evaluated visually (does it make sense?)
  • If the conditional expectation function is non-linear (e.g. a “U” shape in mean of \(y\) across values of \(x\)) linear regression is best linear approximation.